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Abstract 

Background: Genetic markers can be used to identify and verify the origin of individuals. Motivation for the 
inference of ancestry ranges from conservation genetics to forensic analysis. High density assays featuring Single 
Nucleotide Polymorphism (SNP) markers can be exploited to create a reduced panel containing the most 
informative markers for these purposes. The objectives of this study were to evaluate methods of marker selection 
and determine the minimum number of markers from the BovineSNPSO BeadChip required to verify the origin of 
individuals in European cattle breeds. Delta, Wright's Fsj, Weir & Cockerham's Fsj and PCA methods for population 
differentiation were compared. The level of informativeness of each SNP was estimated from the breed specific 
allele frequencies. Individual assignment analysis was performed using the ranked informative markers. Stringency 
levels were applied by log-likelihood ratio to assess the confidence of the assignment test. 

Results: A 95% assignment success rate for the 384 individually genotyped animals was achieved with < 80, < 100, 
< 140 and < 200 SNP markers (with increasing stringency threshold levels) across all the examined methods for 
marker selection. No further gain in power of assignment was achieved by sampling in excess of 200 SNP markers. 
The marker selection method that required the lowest number of SNP markers to verify the animal's breed origin 
was Wright's Fst (60 to 140 SNPs depending on the chosen degree of confidence). Certain breeds required fewer 
markers (< 100) to achieve 100% assignment success. In contrast, closely related breeds require more markers 
(-200) to achieve > 95% assignment success. The power of assignment success, and therefore the number of SNP 
markers required, is dependent on the levels of genetic heterogeneity and pool of samples considered. 

Conclusions: While all SNP selection methods produced marker panels capable of breed identification, the power 
of assignment varied markedly among analysis methods. Thus, with effective exploration of available high density 
genetic markers, a diagnostic panel of highly informative markers can be produced. 



Background 

The identification and verification of the origin of indi- 
viduals is useful in a variety of biological contexts and 
the practical applications of individual assignment pro- 
tocols are extensive [1-3]. Topical issues in population, 
conservation and evolutionary biology can benefit from 
the inference of ancestry of individuals. In an applied 
context, genetic identification can shed light on issues 
such as the contribution of source populations in mixed 
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fisheries [3,4], meat traceability or brand authentication 
[5], translocated or migrant individuals [6], structure 
and levels of discrimination amongst populations [7,8], 
anthropological forensic investigations [2] and tracking 
the trade routes of illegally poached animals [3]. 

Where there is sufficient genetic heterogeneity amongst 
populations genetic markers can be used to identify and 
verify the origin of individuals [7] . Customarily, the genetic 
marker routinely used in individual assignment studies has 
been hypervariable microsatellite loci (e.g. [4,5,7]). How- 
ever, with the advent of genome-wide analytical technolo- 
gies, microsatellites are now being widely replaced by 
Single Nucleotide Polymorphism (SNP) markers (e.g., [9]). 
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SNPs are increasingly favoured as population genetic mar- 
kers because they are highly abundant and widespread in 
the genome, homoplasy is virtually absent, methods to dis- 
cover markers are reliable and subsequent automated gen- 
otyping through assay design can be easily implemented 
[10,11]. Numerous SNPs have been identified in the gen- 
omes of domestic animals, for example, in the dog (> 2.5 
million) [12], chicken {~ 2.8 million) [13] and cattle (> 2 
million) [14]. This has led to the technological develop- 
ment of standard products commonly termed 'SNP Chips', 
which enable the rapid automated large-scale production 
of genomic data. SNP Chips are now commercially avail- 
able for many animal species (e.g., sheep, [15]; pigs, [16]) 
including the lUumina Bovine50SNP BeadChip (lUumina 
Inc., San Diego, CA) for cattle [17,18]. 

These new resources are highly informative; the Bovi- 
ne50SNP BeadChip has already been used in genetic 
studies investigating population genetic structure [19], 
mapping for marker assisted selection of economically 
important traits [20,21] and unravelling the patterns of 
signatures of selection [19,22]. 

Dense genome-wide data is valuable but is relatively 
costly to produce and time-consuming or computation- 
ally expensive to analyse; it is therefore often desirable 
to reduce the number of markers by screening and 
selecting according to their information content to cre- 
ate reduced panels for population genetic analyses 
[23,24]. Several statistical selection methods are available 
to determine which genetic markers contain the most 
information to discriminate among populations. The sta- 
tistic, delta, which measures allele frequency differences, 
is commonly used in the field of human genetics to 
assess marker information content [25,26]. Bowcock et 
al., [27] suggested that informative genetic markers may 
be identified using Wright's Fst [28] and its derivatives 
[29]. Principle Component Analysis (PCA) has also been 
more recently proposed as an alternative method to 
determine population informative SNP markers [24]. 
Other algorithms have been developed to optimize the 
combination of loci selected (e.g., BELS, [30] and refer- 
ences therein); however, these approaches are computa- 
tionally intensive and their execution may be 
prohibitively slow with large datasets. 

The objective of this study was to examine methods 
for selecting population informative SNP loci. To 
achieve this we set out to determine the minimum num- 
ber of SNP markers from the Illumina Bovine5GSNP 
BeadChip (Illumina Inc., San Diego, CA) that is required 
for individual genetic assignment to discriminate a set of 
European cattle breeds (Table 1). This was approached 
in a two-stage manner. First, several SNP selection 
methods were evaluated to determine the genetic infor- 
mation content of each SNP marker and markers were 
ranked by decreasing level of informativeness for each 



of the methods. Second, the likelihood of assigning indi- 
vidual genotypes to their known breed origin was esti- 
mated by cumulatively increasing the number of SNP 
markers, according to the ranked estimates of each SNP 
marker's informativeness for each selection method. 

Results 

Comparison of the marker selection methods 

Frequency histograms of the level of genetic information 
in the SNP markers are shown for each selection 
method (Figure 1). A predominantly left-skewed distri- 
bution was produced for each selection method, except 
delta, which produced a fairly symmetric distribution. 
The majority of the markers contained low to medium 
levels of genetic information and a small proportion had 
high levels of genetic information (Figure 1). 

To assess the level of similarity of the estimates of 
genetic information contained in each SNP marker 
across the different selection methods, a Spearman's 
rank correlation was calculated between the different 
estimates from the selection methods. High levels of 
correlation were observed between delta, pairwise 
Wright's FsT, pairwise W&C's Est and PCA (Table 2). 
Similarly, there was a substantial amount of overlap (> 
200) in the top ranked 500 SNP markers between these 
four selection methods (Table 2). In contrast, the level 
of correlation was lower between global Est and the 
other selection methods (Table 2). There was far less 
overlap (< 200) in the top ranked 500 SNP markers 
between the global Est estimates and the other selection 
methods (Table 2). 

To further explore the conflicting results produced by 
global Wright's and W&C's Est, the observed breed 
allele frequencies for the top ranked 50 SNP markers 
for each selection method were displayed in a box-plot 
[Additional file 1: Supplemental Figure SI]. The boxplot 
is an effective visual representation of both the central 
tendency and dispersion of data. Delta, pairwise 
Wright's Est, pairwise W&C's Est and PCA selected 
SNP markers with median allele frequency between 0.2 
and 0.8 and with large interquartile ranges indicating a 
high level of dispersion amongst the observed allele fre- 
quencies [Additional file 1: Supplemental Figure SI]. In 
comparison, the majority of the top-ranked SNP mar- 
kers selected by global Wright's Est had median allele 
frequencies near 0 or 1 and low levels of dispersion. 
The global W&C's Est resulted in the selection of SNPs 
with a higher level of dispersion amongst the observed 
allele frequencies than global Wright's Fsx> but, none- 
theless, also included markers with quite a few outliers 
and smaller interquartiles ranges than the other selec- 
tion methods. The global Est methods resulted in the 
selection of many SNP markers specific for a single 
most genetically distinct population. 
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Table 1 Information on the breeds 





Breed 


N 


Animal resources of N 


n 


Purpose 


Origin 


Distribution 


Sampling 
Locality 


1 


Angus - 

Rritic h 


23 


severa Scottish farms; majority different sires 


23 


Beef 


Scotland (UK) 


Globa 


UK 


2 


Angus - 


61 24 


Registered bulls and steers 


25 


Beef 


Scotland (UK) 


Globa 


USA 




American 
















3 


Brown Swiss 


74 


24 HapMap^ (3 trios); remaining no pedigree 


24 


Dairy 


Switzerland 


Alpine Furope, 


USA 
















Americas 




4 


Charolais 


135 


26 HapMap^ (3 trios); remaining registered 


25 


Beef 


France 


France, USA, 


USA 
















Brazil, RSA 




5 


Finnish 


444 


215 unrelated; 17 paternal half-sib families with 


10 


Dairy 


Scotland (UK) 


Globa 


Finland 




Ayrsliire 




average of 13 progeny per sire 












6 


Guernsey 


23 


21 HapMap^; remaining unrelated 


21 


Dairy 


sland of 


USA, UK, 


UK 














Guernsey (UK) 


Oceania, RSA 




7 


Hereford 


143 


32 HapMap' (4 trios); remaining registered 


25 


Beef 


UK 


Globa 


USA 


8 


Holstein 


18904 


Registered 


25 


Dairy 


Netherlands 


Globa 


USA 


9 


Jersey 


93 


28 HapMap' (3 trios); remaining registered 


28 


Dairy 


Island of Jersey 


Globa 


USA 














(UK) 






10 


Limousin 


1621 


All registered 


25 


Beef 


France 


France, UK, USA 


USA 


n 


Norwegian 


21 


HapMap^ (1 trio) 


21 


Dual 


Norway 


Norway 


Norway 




Red 








Purpose 








12 


Piedmontese 


29 


24 HapMap^ (3 trios); remaining unrelated 


19 


Beef 


Italy 


Italy 


Italy 


13 


Red Angus 


15 


Registered 


15 


Beef 


Scotland (UK) 


USA, Australia 


USA 


14 


Red Poll 


23 


Registered, a few shared sires and dams 


23 


Beef 


UK 




UK 


15 


Shorthorn 


108 


Registered (7 trios) 


25 


Dual 


UK 


Globa 


USA 












Purpose 








16 


Simmental 


777 


104 sires; 673 steers from 24 sires 


25 


Beef 


Switzerland 


Globa 


USA 


17 


Welsh Black 


32 


several Welsh farms; unrelated 


25 


Beef 


Wales (UK) 




UK 




Total: 


28589 




384 











N, reference sample size (used to estimate the allele frequencies), ^ HapMap individuals are unrelated except where indicated by 'trio' [45], and, n, number of 
individuals used in assignment testing. 



Assignment precision: overall assessment 

The accuracy of assignment of individual genotypes to 
known breed origin was evaluated by cumulatively add- 
ing 20 markers, in descending order of estimated marker 
informativeness for each selection method. No popula- 
tion genetic differentiation was detected between the 
American and British Angus populations (Table 1), con- 
sequently the two populations were pooled together and 
treated as a single breed in subsequent analyses. 

The success of assignment of the 384 individual geno- 
types to breed of origin at the four stringency level 
thresholds for four of the selection methods (delta, pair- 
wise Wright's FsT, pairwise W&C's Fst and PCA) is pre- 
sented in Figure 2. Strikingly, it is immediately 
noticeable that > 50% assignment success for all selec- 
tion methods is achieved at stringency level LLR > 0 
using just the first 20 SNP markers. Overall, pairwise 
Wright's Fst required the smallest number of SNP mar- 
kers to reach 90%, 95% and 98% correct assignment at 
the four stringency threshold levels (Table 3). Of the 
four selection methods, PCA was the poorest performer, 
requiring > 190 SNP markers to attain 95% assignment 
success (Figure 2; Table 3). The power of assignment 



using PCA as a selection method decreased considerably 
across all the stringency thresholds when a 98% assign- 
ment success was imposed (Figure 2; Table 3). 

Full results are not shown for assignment precision 
using ranked SNP markers for global Fjt because they 
performed comparatively poorly. For global Wright's 
Fst, 90% assignment success was obtained with 230 and 
380 SNP markers at the stringency levels of LLR > 0 
and LLR > 3, respectively. Using up to 400 markers, 
95% assignment success was not achieved at any strin- 
gency level. For global W&C's Fst, 90% assignment suc- 
cess was obtained with 80 and 230 SNP markers at the 
stringency levels of LLR > 0 and LLR > 3, respectively. 
The global W&C's Fst had greater assignment accuracy 
over global Wright's Fst, but still performed worse than 
the other four selection methods (Table 3). 

Randomly chosen SNP sets performed worse than 
ranked informative SNP markers in individual assignment 
analysis (Figure 2). Neither an asymptote nor 95% assign- 
ment success were reached using up to 400 markers (aver- 
age across 20 sets of randomly chosen SNP at LLR > 3). 

Individual assignment analysis using a training set and 
a holdout set was performed in order to evaluate the 
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Figure 1 Frequency histograms of the estimates of genetic information contained in each SNP marlter, for each selection method (x- 
axis scale is method-specific). The majority of the SNP markers display low to moderate estimates of genetic informativeness with few 
markers displaying high levels of population differentiation. 



power of assignment for samples not included in the 
reference population. This cross-validation analysis 
reported slightly worse power of assignment than the 
main analysis [Additional file 1: Supplemental Figure 
S2]. The assignment power for breeds with large sample 



sizes N > 50 was comparable to the results of the main 
analysis (results not shown). However, certain breeds 
with a low sample size had worse assignment power in 
the cross-validation analysis. For example, poor assign- 
ment power was observed in Red Angus and Norwegian 



Table 2 Comparison of the SNP selection methods 




delta 


Global Wright's Fst 


Pairwise Wright's Fst 


Global W&C'S Fst 


Pairwise W&C'S Fst 


PCA [1:8] 


delta 




0.589 


0.884 


0.370 


0.819 


0.928 


global Wright's Fst 


98 




0.847 


0.462 


0.821 


0.682 


pairwise Wright's Fst 


381 


151 




0.448 


0.952 


0.888 


global W&C Fst 


59 


49 


63 




0.461 


0408 


pairwise W&C Fst 


306 


156 


367 


67 




0.810 


PCA [1:8] 


273 


101 


274 


66 


229 





The upper-triangle contains the Spearman rank's correlation results between each 40,483 SNPs ranked for information content by each selection method. The 
lower-triangle contains the amount of overlap for the top 500 ranked SNP markers between each selection method. 
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Figure 2 The percentage assignment success with cumulative number of top-ranked SNP markers at the 4 stringency threshold levels, 
for each selection method. 70% success was achieved with the first 20 SNP markers across all ranking methods; power of assignment did not 
increase beyond 200 SNP markers. Average assignment success across 20 sets of randomly selected markers is also shown for the LLR >3 
stringency threshold level. 



Red, two breeds of low sample size and for which clo- 
sely related breeds were included in the dataset (Angus 
and Finnish Ayrshire, respectively) (results not shown). 

Assignment precision: individual breeds 

The SNP selection methods differed for power of assign- 
ment in individual breeds, but no one method consis- 
tently outperformed any other in all breeds (Table 4). 



No substantial further gain in power of assignment in 
individual breeds was observed beyond ~ 200 SNP mar- 
kers. Certain breeds required relatively few SNP markers 
to attain > 95% assignment success (Table 4). For exam- 
ple, the Jersey breed required < 50 SNPs to achieve 
100% individual assignment; even when strict stringency 
levels were applied. In contrast, the Charolais breed 
required -100 SNP markers to achieve > 95% individual 



Table 3 Individual assignment performance for the four selection methods 





delta 






pairwise Wright's Fst 




pairwise W&C's Fsj 




RCA 






Logio 


90% 


95% 


98% 


90% 


95% 


98% 


90% 


95% 


98% 


90% 


95% 


98% 


0 


42.47 


59.94 


86.48 


40.25 


57.44 


83.72 


36.53 


62.89 


1 03.07 


50.58 


75.52 


116.07 


1 


67.97 


90.99 


12936 


60.12 


80.50 


114.45 


64.37 


89.21 


129.27 


71.85 


98.36 


152.26 


2 


95.62 


1 26.63 


1 79.26 


80.02 


104.62 


147.79 


89.63 


119.13 


171.29 


101.62 


139.54 


283.40 


3 


1 23.46 


1 59.05 


209.70 


10541 


137.29 


1 95.69 


1 20.04 


159.83 


241.62 


139.72 


1 9240 


403.89 



Estimated number of SNP markers required to achieve 90%, 95% and 98% correct assignment at the four stringency thresholds for each SNP selection method 
(the individuals from the two Angus populations are pooled). Numbers estimated from asymptotic regression equation. 
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assignment and power was severely compromised with 
increasing stringency level. 

There was a positive significant correlation between 
the percentage of correctly assigned individuals and a 
breed's average level of genetic differentiation (Figure 3; 
Spearman's rank correlation, rho = 0.635, p = 0. 0082). 

Type I (false positives) and II errors (false negatives) that 
occurred in the individual assignment analysis, using pair- 
wise Wright's FsT at the lowest stringency threshold level 
(LLR > 0) were calculated [Additional file 1: Supplemental 
Table SI]. Using 50 SNP markers, 5 breeds were assigned 
with 100% assignment success, and the remaining breeds 
had type I errors of < 15%. The type I error rate was high- 
est for Angus (14.6%), followed closely by Red Angus 
(13.3%), whereby if an individual was not assigned to its 
correct origin it was assigned to the other breed. Using 50 
SNP markers, eight breeds had no individuals assigned 
from other breeds, and the remaining breeds displayed a 
type II error of < 17% (except for the Red Angus breed, 
where 35% of the assigned individuals were Angus; and 
this may have been inflated by the relatively low sample 
size of Red Angus breed (15), compared to Angus (41)). 
The type I and II error rates decreased to < 5% by 150 SNP 
markers. 

Ascertainment bias 

The SNP markers on the BovineSNPSO BeadChip were 
discovered through various breed sources. The majority 
of the markers were discovered from Angus, Holstein 
and Hereford breeds (others included Charolais, Limou- 
sin, Red Angus, Simmental, Jersey, Limousin and Nor- 
wegian Red, but fewer SNPs were found through these 
breeds) [18]. The inclusion of few representative sources 
could influence the level of SNP informativeness and 
individual assignment power, such that breeds used in 
the discovery process show higher SNP variability. 
Although Jersey was one of the breeds used for SNP dis- 
covery, it had the lowest average minor allele frequency 
(MAP) (Table 5). MAP values for Angus, Hereford and 
Holstein were relatively high but lower than for Charo- 
lais and Simmental. The power of assignment at a breed 
level revealed that the breeds represented during the 
SNP discovery process were not amongst those (except 
for Jersey) that required comparatively fewer markers to 
achieve 100% assignment success (Table 4). 

The top 500 SNP markers ranked by decreasing infor- 
mativeness were listed with their corresponding SNP 
discovery method (7 in total, [18]) [Additional file 2: 
Supplemental Table S2]. A x^-test revealed that the pro- 
portions of SNP discovery methods represented in the 
pairwise Wright's Fst 500 top SNP markers [Additional 
file 2: Supplemental Table S2] were not significantly dif- 
ferent from those of the overall Bovine SNP50 set {x , df 
= 36, NS). 



Discussion 

The principal goal of this study was to evaluate marker 
selection methods and determine the minimum number 
of SNP markers from the BovineSNP50 BeadChip 
required to effectively and confidently assign individual 
genotypes to European cattle breeds. While all SNP 
selection methods yielded reduced marker panels cap- 
able of breed identification, the power of assignment 
varied markedly among analysis methods. 

Behaviour of the marker selection methods 

The pairwise Wright's Fst selection method marginally 
outperformed other selection methods in the individual 
assignment analysis (Table 3, Figure 2). Nonetheless, 
three other selection methods, delta, pairwise W&C's 
Fst and PCA, did not perform poorly at ranking mar- 
kers or for assignment success rates. Across these selec- 
tion methods, to achieve 95% assignment success, < 80, 
< 100, < 140 and < 200 SNP markers were required at 
the stringency threshold levels of LLR > 0, LLR > 1, 
LLR > 2 and LLR > 3, respectively (Table 3, Figure 2). 
These four selection methods (delta, pairwise Wright's 
FsTi pairwise W&C's Fst and PCA) to a large extent 
agreed on the most informative SNP markers. The 
resulting estimates of genetic informativeness of each 
SNP marker were highly correlated across the four 
selection method and there was a large degree of over- 
lap among the top-ranked 500 SNP markers (Table 2). 
This was to be expected because all methods were 
applied to individual SNP marker allele frequencies. In 
addition, it has been demonstrated that delta and 
Wright's Fst function similarly [31]. However, PCA 
exhibited the poorest correlation with the other meth- 
ods and lowest overall individual assignment power. 
Paschou et al., [24] advocated using PCA to determine 
marker informativeness because PCA renders an overall 
estimate for a SNP marker, as compared with other 
selection methods where it is necessary to estimate an 
average from pairwise calculations when the number of 
populations (K) > 2. PCA is an approach used to charac- 
terise the structure of a set of variables (in this case 
SNPs). The inferred relationships between objects (e.g., 
populations/breeds) are determined by the structure of 
the covariance matrix between the marker allele fre- 
quencies. Thus, the informativeness of a given marker 
will depend on the other markers included in the analy- 
sis and this could influence the informative markers that 
PCA identified. In contrast, delta and Fst do not take 
into account the relationships amongst markers and the 
level of information of each marker is estimated inde- 
pendently of the others. 

The remaining two selection methods, global Wright's 
and W&C's Fst. performed comparatively poorly in the 
individual assignment test. As similarly observed by 
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Table 4 Power of assignment in individual breeds 

Delta pairwise Wright's Fst pairwise W&C's Fst PCA 



Breed 


IVIUI l\d 3 


I09O 


log1 


log2 


log3 


logO 


log1 


log2 


log3 


logO 


log1 


log2 


log3 


logO 


logi 


log2 


logs 


Angus 


50 


1 00 


79 1 7 


66 67 


33 33 


85 4 


64 58 


43 75 


1 8 75 


93 8 


81 25 


37 5 


1 6 67 


85 4 


68 75 


52 08 


20 83 




1 00 


100 


91 67 


77 08 


72 92 


97 9 


91 67 


89 58 


87 5 


1 00 


1 00 


95 83 


91 67 


896 


79 1 7 


77 08 


5042 




200 


100 


1 00 


1 00 


100 


100 


1 00 


1 00 


1 00 


1 00 


1 00 


97 92 


97 92 


1 00 


97 92 


97 92 


97 92 




300 


100 


1 00 


1 00 


100 


100 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


97 92 


97 92 




400 


100 


1 00 


100 


100 


100 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


100 


1 00 


Dl UVvl 1 J)vv\bb 


50 


1 00 


95 8 


95 8 


95 8 


1 00 


1 00 


1 00 


95 8 


1 00 


1 00 


1 00 


91 7 


1 00 


1 00 


1 00 


1 00 




1 00 


100 


1 00 


1 00 


100 


100 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 




200 


100 


1 00 


1 00 


100 


100 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 




300 


100 


1 00 


1 00 


100 


100 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 




400 


100 


1 00 


1 00 


100 


100 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


1 00 


'^l Id 1 Uld lb 


50 


72 


56 


24 


Q 


92 


76 


60 


24 


92 


60 


44 


1 6 


88 


68 


20 


4 




1 00 


88 


76 


56 


24 


96 


96 


84 


60 


95 


88 


80 


44 


92 


92 


84 


52 




200 


96 


96 


96 


92 


96 


96 


92 


92 


96 


95 


96 


92 


95 


96 


92 


84 




300 


100 


96 


96 


96 


96 


96 


96 


92 


95 


95 


96 


96 


95 


96 


96 


96 




400 


100 


96 


96 


96 


96 


96 


96 


96 


95 


95 


96 


96 


95 


96 


96 


96 


r 1 1 H 1 Ibl 1 r\y 1 il 1 1 1 fcr 


50 


1 00 


60 


20 


1 0 


1 00 


90 


60 


40 


70 


70 


50 


50 


1 00 


1 00 


70 
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Table 4 Power of assignment in individual breeds (Continued) 
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Percentage of individuals that were successfully assigned to their breed origin, at the 4 stringency threshold levels, for each selection method. 



Kersbergen et al. [32], global Fst may not be appropri- 
ate to assess the level of genetic information in SNP 
markers when K > 2, as the method could result in the 
selection of SNP markers which are specific in distinct 
populations [Additional file 1: Supplemental Figure SI]. 
The selected SNP markers that were specific for only 
the most distinct breed were not segregating in the 
majority of the other breeds [Additional file 1: Supple- 
mental Figure SI], and thus the expected heterozygosity 
would be low. Indeed, it is suggested that genetic mar- 
kers with high expected heterozygosity are informative 
and therefore useful in individual assignment analysis 
[15,33], such as those identified using pairwise Wright's 
FgT. delta, pairwise W&C's Fgx and PCA. As a result the 
performance of individual assignment tests using global 
Fst selected markers may be compromised compared to 
the other selection methods. Consequently, when K > 2 
it is preferable to estimate Fst. either Wright's or 



W&C's, on a population pairwise basis and then esti- 
mate the average across the pairwise comparisons to 
obtain an overall estimate for a marker. 

Assignment precision: minimum number of marlcers 
required 

Since pairwise Wright's Fst outperformed the other 
selection methods (Table 3) this selection method was 
subsequently adopted to estimate the minimum number 
of SNP markers required to achieve the desired assign- 
ment success. At the most commonly used stringency 
threshold (LLR > 0) and the accepted level of appropri- 
ate assignment success (95%) [34], < 60 SNP markers 
were required for the correct assignment of the 384 
individual genotypes. When stricter stringency threshold 
levels are applied, the number of SNP markers required 
to attain 95% assignment success increased (Table 3). 
Depending on the chosen degree of confidence, the 
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Figure 3 Scatterplot of average pairwise breed genetic differentiation correlated against percentage correct assignment using the 
top-ranked 20 SNP markers (Wright's Fst method; Spearman's rank correlation, r = 0.635). 



required number of markers ranges from 60 to 140 
SNPs (80, 105 and 140 at LLR > 1, LLR > 2 and LLR > 
3, respectively). While the percentage of assignment suc- 
cess decreases with increasing stringency thresholds, so 
too does the risk of false assignment. Consequently, 
there is greater confidence in the estimated genotype 
likelihoods and LLR calculations if a strict stringency 
threshold (LLR > 3) is adopted. 

It is difficult to compare the results obtained here to 
other studies conducted on individual assignment analy- 
sis in cattle breeds. First, most previous studies used 
microsatellite markers and, second, these studies had 
only a limited number of markers (e.g., [5,8]). These stu- 
dies also primarily focused on the practicality of 



assigning individuals among cattle breeds with the avail- 
able markers and were not concerned with how many 
markers would be required to achieve confident assign- 
ment of individual genotypes. In a study of French cattle 
breeds, Maudet et al., [8] found that using 23 microsa- 
tellite loci > 93% of individuals could be assigned to 
their breed origin. A more recent study used SNP mar- 
kers but did not have a large dataset at their disposal 
and could, again, only address the practicality of indivi- 
dual assignment with the limited set of available mar- 
kers [9]. Using 90 SNP markers genotyped in 24 
European cattle breeds they were able to correctly assign 
85% of individuals to their breed origin. McKay et al., 
[35] used STRUCTURE to assess the number of loci 
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Table 5 Average minor allele frequency for each breed 
across the 40, 483 SNP markers 



Breed MAF 
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0.230 
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0.218 


Red Poll 


0.215 
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0.204 


Simmental 


0.244 


Welsh Black 


0.221 



required to estimate the number of ancestral popula- 
tions in 6 Bos taurus breeds. The use of 150 randomly 
chosen loci (from a dataset of 2,641 loci) yielded the 
correct number of clusters in only 40% of cases, consis- 
tent with reduced assignment power for randomly- 
selected markers found in the current study (Figure 2). 
The lower assignment power in those studies was most 
probably a direct consequence of using an insufficient 
number of informative loci. The comparatively high 
assignment power of fewer SNP markers in the current 
study was probably due to the availability of > 40,000 
SNP markers and the benefit of selecting markers that 
contain the most genetic information with respect to 
the reference populations. Only a few highly poly- 
morphic microsatellite loci are required in individual 
assignment studies. However, dense SNP panels are now 
available for many species and SNP markers possess 
numerous advantages, including cost, throughput and 
reliability, making them a favourable choice over 
microsatellites. 

Assignment success: individual breeds 

It is evident that certain breeds in this study require far 
fewer markers to achieve > 95% assignment success 
than others, regardless of the selection method used 
(Table 4, Figure 3). For example, the Jersey, Brown 
Swiss, Guernsey and Piedmontese breeds achieved 100% 
assignment success, even at stricter stringency thresh- 
olds using 50 SNP markers (pairwise Wright's Fjt. LLR 
> 2, Table 4). In contrast, the French breeds like the 
Charolais, Limousin and Simmental achieved ~ 90% 
assignment success at LLR > 0, which fell to < 50% with 
increasing stringency threshold using 50 SNP markers 



(Table 4). Similarly, the breeds that exhibited a lower 
power of assignment success (Table 4) also had higher 
type I and II error rates (Table SI). 

A problem associated with the use of SNP markers in 
population genetics is ascertainment bias, which could 
influence population genetic estimates and may contri- 
bute to differences in assignment performance for indi- 
vidual breeds [10]. Heterogeneity amongst sample 
representatives can introduce ascertainment bias and 
breeds not included in the SNP discovery process could 
have lower minor allele frequencies (MAF) [15,36]. The 
average MAF was lowest in the Brown Swiss, Guernsey 
and Jersey breeds (Table 5), one of which was repre- 
sented in the SNP discovery process and the three 
breeds which were central to the process (Angus, Here- 
ford, Holstein) did not have the highest average MAF 
values. In addition, no one particular SNP discovery 
method was over-represented in the top identified SNP 
markers [Additional file 2: Supplemental Table S2] as 
the discovery method proportions were similar to that 
represented on the Bovine SNP50 assay [18]. SNP ascer- 
tainment bias would have been more pronounced if B. t. 
indicus breeds had been included in this study [36]. 
Morin et al., [10] concluded that ascertainment bias 
may be an issue in the assessment of population size 
and demographic changes. It is least important for indi- 
vidual identification and assignment tests, where the 
intentional selection of informative markers provides 
greater power than do randomly chosen markers. 

A factor that could affect the power of assignment suc- 
cess and variation in power of assignment between breeds 
is the level of pairwise genetic differentiation amongst the 
breeds. It is known that the number of markers required 
to obtain a high accuracy of assignment is influenced by 
the level of population genetic differentiation [8,37]. That 
is, it depends closely on the populations under considera- 
tion and respective levels of genetic heterogeneity. As 
demonstrated in Figure 3, the level of genetic differentia- 
tion of a breed, measured by FgT, is correlated with power 
of assignment success. Low breed genetic differentiation 
was observed in Charolais and Simmental, which similarly 
showed higher rates of Type I and II errors (Figure 3, 
[Additional file 1: Supplemental Table SI]). False positive 
assignments also occurred between breeds of known 
recent ancestry, for example, Angus and Red Angus, and 
Finnish Ayrshire and Norwegian Red [36]. In addition, 
cases of mistaken assignment occurred between Charolais, 
Simmental, Limousin and Shorthorn, where the pairwise 
FsT values amongst these breeds were < 0.1. In a study on 
individual assignment using microsatellites, Ciampolini et 
al, [5] reported that of the four breeds under considera- 
tion, Charolais and Limousin had the lowest level of pair- 
wise genetic differentiation and were the most difficult to 
discriminate between (Fst = 0.041). As assignment success 
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is a function of both the number of markers and popula- 
tion genetic differentiation, the level of breed genetic dif- 
ferentiation is indicative of the potential number of SNP 
markers necessary to attain high levels of power in indivi- 
dual assignment tests [6,37] . 

Informative marker panels in population genetics 

Evaluation of the selection methods revealed that only a 
small proportion of the markers from the BovineSNP50 
BeadChip were highly informative for discriminating 
among 17 breeds, and the majority contained medium 
to low levels of genetic information (Figure 1). This is 
consistent with the development of the assay in which 
SNPs with high MAP across B. t. taurus breeds were 
preferentially selected in the assay design. Consequently, 
sets of randomly chosen SNP markers contained suffi- 
cient genetic information to produce moderate levels of 
individual assignment power (Figure 2). However, in 
contrast, a substantially reduced set of highly informa- 
tive SNP markers were capable of precisely discriminat- 
ing amongst the European cattle breeds (Figure 2). 

Studies have shown that a reduced set of selected 
informative markers can effectively capture the genetic 
structure of human populations [23,24]. For instance, 
Lao et al, [23] found that 10 SNP markers from a lOK 
SNP array contained enough genetic information to dif- 
ferentiate individuals from Africa, Europe, Asia and 
America and additional loci contributed very little extra 
information. Indeed, it is generally considered that unin- 
formative markers (i.e., monomorphic loci) may add 
noise to the results and compromise power of popula- 
tion genetic studies [38,39]. It could be useful to create 
a minimum panel of maximum power, particularly when 
using Bayesian genotypic clustering software such as 
STRUCTURE to elucidate population structure, because 
these approaches are computationally demanding (which 
intensifies as the number of markers increases) [23]. 
Consequently, it is practical and cost-effective to apply a 
selection method to dense assays to isolate the highly 
diagnostic markers and increase the power of analysis. 

The number of markers required for population 
assignment will depend on the species, the populations 
under consideration, their respective level of genetic dif- 
ferentiation and the desired stringency of assignment. 
For instance, within dogs 27% of the genetic variation is 
found between breeds, whereas for humans the level 
between populations is only 5%-10% [40]. As a result, 
the number of SNP markers required for individual 
assignment and discrimination amongst populations 
(breeds) will differ between species under consideration. 

Conclusion 

Although the marker selection methods explored in this 
study agreed to a large extent on which SNPs were the 



most informative, there were significant differences in 
the power of assignment produced by the resulting 
ranked SNP panels, with pairwise Wright's Fst outper- 
forming all other approaches. These results illustrate 
that with effective exploration it is possible to identify 
the most informative markers and produce an optimal 
minimum set of markers that can differentiate among 
populations. 

Methods 

Data 

Allele frequencies from 17 cattle breeds representing the 
'reference' populations and a total of 384 individual gen- 
otypes of known breed origin, sampled from the refer- 
ence populations, were available (Table 1). Information 
on the sampling of the reference populations is given in 
Table 1. Decker et al, [36] selected 40,843 SNPs from 
the Bovine SNP50 Bead Chip after a strict quality 
screening where "Loci selected for analysis were all 
located on autosomes, had a call rate of at least 80% in 
36 (75%) B. t. taurus breeds, and were not mono- 
morphic in all breeds.... ". Since only B. t. taurus breeds 
were used in the current study the selected set of SNP 
markers by Decker et al., [36] was adopted. Detailed 
information of the genotyping procedure can be found 
in Decker et al., [36]. 

Selection methods to determine the most informative 
markers 

The breed-specific allele frequencies for the 40,483 
SNPs were used to estimate the genetic information 
contained in each SNP marker using the following selec- 
tion methods: delta, Wright's Fst, Weir and Cocker- 
ham's Fst and PCA. The larger the estimated value, the 
more informative the marker is at genetically discrimi- 
nating the sampled populations. All analyses were con- 
ducted in the R statistical environment [41]. 

Delta 

One of the most commonly used measures of marker 
informativeness is delta [25]. For a biallelic marker the 
delta value is given by j Pa, - Pa/ |, where Pa/ and Pa; 
are the frequencies of allele A in the fi^ and popula- 
tions, respectively. Delta can only be estimated between 
pairs of populations (K = 2). Since K = 17 in this study, 
values were averaged across all pairwise comparisons to 
produce an estimated value for each SNP marker. 

Fst 

Wright [28] introduced F-statistics to describe the pro- 
portion of genetic diversity within and among popula- 
tions [42]. Wright's Fst statistic has been extended by 
several authors and a preferable statistic based on the 
analysis of variance of allele frequencies is Weir and 
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Cockerham's (W&C) FgT [29]. For both methods 
unbiased estimates of Fst were first calculated over all 
populations (global Fst) and on a pairwise basis (pair- 
wise Fst)> with the latter values being averaged over all 
pairs to produce an estimated information content value 
for each SNP marker. 
Wright's Fst 

Wright's Fst was estimated as var(p^)/p^(l — p^), where 
var(/?^) is the variance of the allele frequency among 
breeds and p^ is the mean allele frequency across the 
breeds. 
W&Cs Fst 

Unbiased estimates of W&C's Fst were estimated as 
functions of variance components as detailed in Akey et 
al., [43]. Estimated Fst can be negative if alleles drawn 
at random from within a population are less similar to 
one another than those drawn from different popula- 
tions (Fst < 0) [43,44]. In this study the estimated Fst 
values were left as negative. 

Principal Component Analysis (PCA) 

PCA is a statistical technique that can be used to reduce 
the dimension of a multivariate dataset. The original 
variables are linearly transformed by PCA into a set of 
underlying variables ("principal components") ranked in 
terms of their variance, such that most of the original 
variability may be contained in a smaller number of 
variables. Each new variable has an associated eigenvalue 
that measures the respective amount of explained var- 
iance. The coefficients ("loadings") used in the linear 
transformation of the original variables into new vari- 
ables generate the proportion of variance that a variable 
contributes to a given principal component. 

PCA was performed following Paschou et al., [24], but 
on the breed-specific allele frequency matrix rather than 
the individual genotypes. To determine which principal 
components were significant, 100 random matrices were 
created by sampling with replacement allele frequencies 
within each SNP marker across all breeds. The first 
eight principal components for the actual data contained 
more information than in the randomly generated com- 
ponents (i.e., their eigenvalues were greater) and there- 
fore the first eight principal components were used to 
calculate marker informativeness. The loadings for each 
SNP marker were squared and summed over the eight 
significant principal components to produce an estimate 
of informativeness [24]. 

Individual Assignment Analysis 

Several genetic assignment approaches are available 
[6,7,37]. The Bayesian implementation developed by 
Rannala and Mountain [6] has been found to be more 
effective at individual assignment than other methods 
[37]. However, the method of Paetkau et al., [7] is 



equally effective at individual assignment when the 
levels of genetic differentiation between reference popu- 
lations are high [37]. Comparison of the two methods 
for a subset of cattle breeds from the current study 
revealed similar performance levels (results not shown). 
Consequently, the method of Paetkau et al., [7] was 
employed as it is easier to implement than that of Ran- 
nala & Mountain [6] and is most frequently employed 
in empirical studies. 

Allele frequencies of zero were replaced by a value of 
1 X 10' because log(O) is not defined [7]. Likewise, if an 
observed allele frequency was 1, it was replaced by a 
value of 0.99999. 

Genotype likelihoods were calculated for each indivi- 
dual in each reference population based on the observed 
allele frequencies for each marker. Let Pij^ denote the 
frequency of the /c* allele {k = 1, 2) at the locus (j = 
1 .. /) in the population (/=!.. I). Let gj^i,' denote an 
individual's diploid genotype at the locus, and let the 
Mendelian transmission probability of gji^i^' arising in the 
population be Tigj 



\jkk- 



where a genotype is homozygous if k = k' and hetero- 
zygous otherwise, under the assumption of random 
union of gametes. Next, let g denote an individual's mul- 
tilocus genotype. The likelihood of an individual diploid 
genotype occurring in a particular population, T(g\i), 
was estimated as above, as the square of the observed 
allele frequency for homozygotes or twice the product 
of the two allele frequencies for heterozygotes. Under 
the assumption of independence between the / loci 

j 

and 

^ogioiT{g\i)) = Xllogio(T(Sfefe'IO)- 

To assess the performance of the assignment proce- 
dure, log-likelihood ratios (LLR) were calculated by 
comparing the likelihood of an individual being assigned 
to its population of origin and the likelihood of it being 
assigned to another population 

LLR = logio(T(g|u)) - logio(T(g|iB)). 

Different stringency thresholds can be applied as con- 
fidence levels of assignment precision. Four stringency 
levels are commonly used: LLR > 0, LLR > 1, LLR > 2 
and LLR > 3 [4,25,26,34]. LLR > 1, LLR > 2 and LLR > 
3 levels, respectively, mean that a multilocus genotype 
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has to be 10, 100 or 1000 times more likely in one 
population than any other. The LLR > 0 level requires 
that the genotype to be more likely in one population 
than any other. The correct assignment of an individual 
genotype to its known origin occurred when the calcu- 
lated LLR was greater than the selected stringency level. 
If the LLR was lower than the selected stringency level, 
the individual genotype failed to be assigned to its origin 
and was instead assigned to the reference population 
that yielded the highest overall log-likelihood. 

To obtain an estimate of the number of SNP markers 
required to achieve 90%, 95% and 98% correct assign- 
ment success of the 384 individual genotypes for each of 
the selection methods, at each of the 4 threshold levels, 
a non-linear regression model was fitted to the curves of 
correct assignment percentage against cumulative mar- 
kers. An asymptotic regression model (y = a + h exp'^'*^, 
where parameter a represents the value of the asymp- 
tote, parameter b represents the difference between the 
value of y when x = 0 and the upper asymptote and 
parameter c represents the natural logarithm of the rate 
of exponential increase) was found to best fit the data. 
When « > 0, ^ < 0 and c < 0 the model represents the 
law of diminishing returns in which the rate of increase 
of y declines with successive equal increments of x. 

To test whether the level of genetic differentiation of a 
breed corresponded to the power of assignment, a 
Spearman's rank correlation was calculated between the 
percentage of correctly assigned individuals for the 20 
top ranked SNP markers for each breed (selection 
method = pairwise Wright's FgT, LLR > 0) and the aver- 
age FsT for each breed (pairwise Wright's Fst values 
across all breeds, based on 40, 843 SNP markers, aver- 
aged to provide an estimate for each breed). 

A negative control to individual assignment analysis 
was applied by analysing 20 sets of 400 randomly 
selected SNPs. The average individual assignment suc- 
cess was estimated across the 20 random SNP sets at 
the stringency level LLR > 3. 

In order to evaluate the power of assignment for sam- 
ples of unknown origin, the individual assignment analy- 
sis was evaluated by cross-validation whereby a training 
sample was used to identify the informative loci and a 
holdout sample from each of the breeds was used to 
test the power of the resulting panel and the reference 
training sample. For breeds with a reference sample size 
> 50 (Table 1) the holdout sample comprised all the 
individuals to be assigned (those in column «); these 
were removed from their respective reference breed and 
allele frequencies of the reference breeds were re-esti- 
mated. For breeds with a reference sample size < 50 
(Table 1) five random individual genotypes of the indivi- 
duals assigned in the main analysis (those in column n) 
were designated as the holdout sample; these were 



removed from their respective reference breed and allele 
frequencies were re-estimated. The individual assign- 
ment analysis was repeated with the new training sam- 
ples and the hold-out samples. 

Additional material 



Additional file 1: Supplemental materials Figure SI: A boxplot of the 
observed breed allele frequencies for the top ranked 50 SNP markers for 
each selection method. Figure S2: A plot of the percentage assignment 
success with cumulative number of top-ranked SNP markers at the 4 
stringency threshold levels. The results of this individual assignment test 
is for the training set and hold-out set where the selection implemented 
was Wright's painwise fsr- Table ST Type I (false positives) and II errors 
(false negatives). The table details the error rates that occurred in the 
individual assignment analysis, using pairwise Wright's Fsj at the lowest 
stringency threshold level (LLR > 0). 

Additional file 2: Table S2. Top 500 SNP markers. The genetic markers 
are ranked by decreasing informativeness and the corresponding SNP 
discovery methods are listed with each SNP marker. 
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